A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

نویسنده

  • Guodong Zhou
چکیده

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms 1 using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chunking-based Chinese Word Tokenization

() () , (log) (log) Abstract This paper introduces a Chinese word tokenization system through HMM-based chunking. Experiments show that such a system can well deal with the unknown word problem in Chinese word tokenization. The second term in (2-1) is the mutual information between T and. In order to simplify the computation of this term, we assume mutual information independence (2-2): 1 1 log...

متن کامل

Chinese Word Segmentation and Named Entity Recognition Based on a Context-Dependent Mutual Information Independence Model

This paper briefly describes our system in the third SIGHAN bakeoff on Chinese word segmentation and named entity recognition. This is done via a word chunking strategy using a context-dependent Mutual Information Independence Model. Evaluation shows that our system performs well on all the word segmentation closed tracks and achieves very good scalability across different corpora. It also show...

متن کامل

Chinese Unknown Word Identification Using Character-based Tagging and Chunking

Since written Chinese has no space to delimit words, segmenting Chinese texts becomes an essential task. During this task, the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First, a morphological analysis is done to obtain i...

متن کامل

Pruning False Unknown Words to Improve Chinese Word Segmentation

During the process of unknown word detection in Chinese word segmentation, many detected word candidates are invalid. These false unknown word candidates deteriorate the overall segmentation accuracy, as it will affect the segmentation accuracy of known words. Therefore, we propose to eliminate as many invalid word candidates as possible by a pruning process. Our experiments show that by cuttin...

متن کامل

Joint Word Segmentation, POS-Tagging and Syntactic Chunking

Chinese chunking has traditionally been solved by assuming gold standard word segmentation. We find that the accuracies drop drastically when automatic segmentation is used. Inspired by the fact that chunking knowledge can potentially improve segmentation, we explore a joint model that performs segmentation, POStagging and chunking simultaneously. In addition, to address the sparsity of full ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005